in this course) ˆ Y =time to event, follow-up curtailed: covered under ˆ Missing at random (MAR) a

Size: px

Start display at page:

Download "in this course) ˆ Y =time to event, follow-up curtailed: covered under ˆ Missing at random (MAR) a"

Ralf Hopkins
5 years ago
Views:

1 Chapter 3 Missing Data 3.1 Types of Missing Data ˆ Missing completely at random (MCAR) ˆ Missing at random (MAR) a ˆ Informative missing (non-ignorable non-response) See 1, 38, 59 for an introduction to missing data and imputation concepts. a Although missing at random (MAR) is a non-testable assumption, it has been pointed out in the literature that we can get very close to MAR if we include enough variables in the imputation models CHAPTER 3. MISSING DATA Prelude to Modeling ˆ Quantify extent of missing data ˆ Characterize types of subjects with missing data ˆ Find sets of variables missing on same subjects 3.3 Missing Values for Different Types of Response Variables ˆ Serial data with subjects dropping out (not covered in this course) ˆ Y =time to event, follow-up curtailed: covered under survival analysis b ˆ Often discard observations with completely missing Y but sometimes wasteful ˆ Characterize missings in Y before dropping obs. 3.4 Problems With Simple Alternatives to Imputation Deletion of records b White and Royston 131 provide a method for multiply imputing missing covariate values using censored survival time data.

2 CHAPTER 3. MISSING DATA 56 ˆ Badly biases parameter estimates when the probability of a case being incomplete is related to Y and not just X 82. ˆ Deletion because of a subset of X being missing always results in inefficient estimates ˆ Deletion of records with missing Y can result in biases 30 but is the preferred approach under MCAR c ˆ However von Hippel 126 found advantages to a use all variables to impute all variables then drop observations with missing Y approach ˆ Only discard obs. when MCAR can be justified Rarely missing predictor of overriding importance that can t be imputed from other data Fraction of obs. with missings small and n is large ˆ No advantage of deletionexceptsavings of analyst time ˆ Making up missing data better than throwing away real data c Multiple imputation of Y in that case does not improve the analysis and assumes the imputation model is correct. CHAPTER 3. MISSING DATA 57 Adding extra categories of categorical predictors ˆ Including missing data but adding a category missing causes serious biases 1, 71, 117 ˆ Problem acute when values missing because subject too sick ˆ Difficult to interpret ˆ Fails even under MCAR 1, 38, 71, 119 ˆ May be OK if values are missing because of not applicable, e.g. you have a measure of marital happiness, dichotomized as high or low, but your sample contains some unmarried people. OK to have a 3-category variable with values high, low, and unmarried. d Likewise, serious problems are caused bysetting missing continuous predictors to a constant (e.g., zero) and adding an indicator variable to try to estimate the effect of missing values. Two examples fron Donder et al. 38 using binary logistic regression, N = 500. d Paul Allison, IMPUTE list, 4Jul09

3 CHAPTER 3. MISSING DATA 58 Results of 1000 Simulations With β1 = 1.0 with MAR and Two Types of Imputation Imputation ˆβ1 S.E. Coverage of Method 0.90 C.I. Single Multiple Now consider a simulation with β1 = 1, β2 = 0, X2 correlated with X1(r = 0.75) but redundant in predicting Y, use missingness indicator when X1 is MCAR in0.4 of 500 subjects. This is also compared with grand mean fill-in imputation. Results of 1000 Simulations Adding a Third Predictor Indicating Missing for X1 Imputation ˆβ1 Method ˆβ2 Indicator Overall mean 0.55 CHAPTER 3. MISSING DATA Strategies for Developing Imputation Algorithms The goal of imputation is to preserve the information and meaning of the non-missing data. Exactly how are missing values estimated? ˆ Could ignore all other information random or grand mean fill-in ˆ Can use external info not used in response model (e.g., zip code for income) ˆ Needto utilize reason for non-response if possible ˆ Use statistical model with sometimes-missing X as response variable ˆ Model to estimate the missing values should include all variables that are either 1. related to the missing data mechanism; 2. have distributions that differ between subjects that have the target variable missing and those that have it measured;

4 CHAPTER 3. MISSING DATA associated with the sometimes-missing variable when it is not missing; or 4. included in the final response model 6, 59 ˆ Ignoring imputation results in biased ˆV ( ˆβ) ˆtranscan function in Hmisc library: optimal transformations of all variables to make residuals more stable and to allow non-monotonic transformations ˆaregImpute function in Hmisc: good approximation to full Bayesian multiple imputation procedure using the bootstrap ˆaregImpute andtranscan work withfit.mult.impute to make final analysis of response variable relatively easy ˆ Predictive mean matching 82 : replace missing value with observed value of subject having closest predicted value to the predicted value of the subject with the NA PMM can result in some donor observations being used repeatedly CHAPTER 3. MISSING DATA 61 Causes lumpy distribution of imputed values Address by sampling from multinomial distribution, probabilities = scaled distance of all predicted values to predicted value (y ) of observation needing imputing Tukey s tricube function is a good weighting function (used in loess): wi = (1 min(di/s, 1) 3 ) 3, di = ŷi y s = 0.2 mean ŷi y is a good default scale factor scale so that wi = 1 ˆ Recursive partitioning withsurrogatesplits handles case where a predictor of a variable needing imputation is missing itself 3.6 Single Conditional Mean Imputation ˆ Can fill-in using unconditional mean or median if number of missings low and X is unrelated to other Xs

5 CHAPTER 3. MISSING DATA 62 ˆ Otherwise, first approximation to good imputation uses other Xs to predict a missing X ˆ This is a single best guess conditional mean ˆ ˆXj = Z ˆθ, Z = X j Cannot include Y in Z without adding random errors to imputed values (would steal info from Y ) ˆ Recursive partitioning is very helpful for nonparametrically estimating conditional means 3.7 Multiple Imputation ˆ Single imputation using a random draw from the conditional distribution for an individual ˆXj = Z ˆθ + ˆǫ, Z = [X j, Y ] ˆǫ = n(0, ˆσ) or a random draw from the calculated residuals bootstrap approximate Bayesian bootstrap 59, 100 : sample with replacement from sample with replacement of residuals CHAPTER 3. MISSING DATA 63 ˆ Multiple imputations (M) with random draws Draw sample of M residuals for each missing value to be imputed Average M ˆβ In general can provide least biased estimates of β Simple formula for imputation-corrected var( ˆβ) Function of average apparent variances and between-imputation variances of ˆβ BUT full multiple imputation needs to account for uncertainty in the imputation models by refitting these models for each of the M draws transcan does not do that; aregimpute does ˆ Note that multiple imputation can and should use the response variable for imputing predictors 87 ; if should also use auxiliary variables not intended to be in the response model ˆaregImpute algorithm 87 Takes all aspects of uncertainty into account using the bootstrap

6 CHAPTER 3. MISSING DATA 64 Different bootstrap resamples used for each imputation by fitting a flexible additive model on a sample with replacement from the original data This model is used to predict all of the original missing and non-missing values for the target variable for the current imputation Uses flexible parametric additive regression models to impute There is an option to allow target variables to be optimally transformed, even non-monotonically (but this can overfit) Uses predictive mean matching for imputation; no residuals required By default uses weighted PMM; option for just using closest match When a predictor of the target variable is missing, it is first imputed from its last imputation when it was a target variable First 3 iterations of process are ignored ( burnin ) Compares favorably to S MICE approach CHAPTER 3. MISSING DATA 65 Example: a aregimpute( age + sex + bp + death, data=mydata, n.impute=5) f fit.mult.impute(death rcs(age,3) + sex + rcs(bp,5), lrm, a, data=mydata) See Barzi and Woodward 6 for a nice review of multiple imputation with detailed comparison of results (point estimates and confidence limits for the effect of the sometimes-missing predictor) for various imputation methods. Barnes et al. 5 have a good overview of imputation methods and a comparison of bias and confidence interval coverage for the methods when applied to longitudinal data with a small number of subjects. Horton and Kleinman 68 have a good review of several software packages for dealing withmissing data, and a comparison of them with aregimpute. Harel and Zhou 59 provide a nice overview of multiple imputation and discuss some of the available software. Caution: Methodscangenerate imputations having very reasonable distributions but still not having the property that final response model regression coeffi-

7 CHAPTER 3. MISSING DATA 66 cients have nominal confidence interval coverage. At present, regression imputation in aregimpute has excellent coverage but predictive mean matching yields confidence intervals that are a bit too narrow. e ˆ WithMICE andaregimpute we are using the chained equation approach ˆ Chained equations handles a wide varietyof target variables to be imputed and allows for multiple variables to be missing on the same subject ˆ Iterative process cycles through all target variables to impute all missing values 118 ˆ Does not attempt to use the full Bayesian multivariate model for all target variables, making it more flexible and easy to use ˆ Possible to create improper imputations, e.g., imputing conflicting values for different target variables ˆ However, simulation studies 118 demonstrate very good performance of imputation based on chained e One area for future research is to check that imputations have the correct collinearities with other covariates. CHAPTER 3. MISSING DATA 67 equations 3.8 Diagnostics ˆ MCAR can be partially assessed by comparing distribution of non-missing Y for those subjects with complete X vs. those subjects having incomplete X 82 ˆ Yucel and Zaslavsky 138 ˆ Interested in reasonableness of imputed values for a sometimes-missing predictor Xj ˆ Duplicate entire dataset ˆ In the duplicated observations set all non-missing values of Xj to missing; let w denote this set of observations set to missing ˆ Develop imputed values for the missing values of Xj ˆ In the observations in w compare the distribution of imputed Xj to the original values of Xj

8 CHAPTER 3. MISSING DATA Summary and Rough Guidelines Table 3.1: Summary of Methods for Dealing with Missing Values Method Deletion Single Multiple Allows non-random missing x x Reduces sample size x Apparent S.E. of ˆβ too low x Increases real S.E. of ˆβ x ˆβ biased if not MCAR x The following contains very crude guidelines. Simulation studies are needed to refine the recommendations. Here proportion refers to the proportion of observations having any variables missing. Proportion of missings 0.05 : Method of imputing and computing variances doesn t matter much Proportion of missings : Constant fill-in if predictor unrelated to other Xs. Single best guess imputation probably OK. Multiple imputation doesn t hurt. Proportion of missings > 0.15 : Multiple imputation, adjust variances for imputation Multiple predictors frequently missing More CHAPTER 3. MISSING DATA 69 important to do multiple imputation and also to be cautious that imputation might be ineffective. Reasonfor missings more important thannumber of missing values.

Missing Data Analysis for the Employee Dataset

Missing Data Analysis for the Employee Dataset 67% of the observations have missing values! Modeling Setup For our analysis goals we would like to do: Y X N (X, 2 I) and then interpret the coefficients